Goto

Collaborating Authors

 error correction


Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

Wu, David X., Kapur, Shreyas, Sahai, Anant, Russell, Stuart

arXiv.org Artificial Intelligence

Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.


CodeVaani: A Multilingual, Voice-Based Code Learning Assistant

Havare, Jayant, Tamilselvam, Srikanth, Mittal, Ashish, Thorat, Shalaka, Jadia, Soham, Apte, Varsha, Ramakrishnan, Ganesh

arXiv.org Artificial Intelligence

Programming education often assumes English proficiency and text-based interaction, creating barriers for students from multilingual regions such as India. We present CodeVaani, a multilingual speech-driven assistant for understanding code, built into Bodhitree [1], a Learning Management System developed at IIT Bombay. It is a voice-enabled assistant that helps learners explore programming concepts in their native languages. The system integrates Indic ASR, a codeaware transcription refinement module, and a code model for generating relevant answers. Responses are provided in both text and audio for natural interaction. In a study with 28 beginner programmers, CodeVaani achieved 75% response accuracy, with over 80% of participants rating the experience positively. Compared to classroom assistance, our framework offers ondemand availability, scalability to support many learners, and multilingual support that lowers the entry barrier for students with limited English proficiency. The demo will illustrate these capabilities and highlight how voice-based AI systems can make programming education more inclusive. Supplementary artifacts and demo video are also made available.


A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Ahmed, Farzad, Jerome, Joniel Augustine, Yetisgen, Meliha, Uzuner, Özlem

arXiv.org Artificial Intelligence

Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.


ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Lin, Ye Bhone, Aung, Thura, Thu, Ye Kyaw, Oo, Thazin Myint

arXiv.org Artificial Intelligence

Abstract--This paper investigates sequence-to-sequence T ransformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IP A and alignment information. T o our knowledge, this is the first study addressing ASR error correction specifically for Burmese. W e evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word-and character-level accuracy over baseline outputs. The proposed AEC model, combining IP A and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.


Continual Error Correction on Low-Resource Devices

Paramonov, Kirill, Ozay, Mete, Mystakidis, Aristeidis, Tsalikidis, Nikolaos, Sotos, Dimitrios, Drosou, Anastasios, Tzovaras, Dimitrios, Kim, Hyunjun, Chang, Kiseok, Mo, Sangdok, Kim, Namwoong, Yoo, Woojong, Moon, Jijoong, Michieli, Umberto

arXiv.org Artificial Intelligence

The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.


"When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings

De, Somsubhra, Kumar, Harsh, A, Arun Prakash

arXiv.org Artificial Intelligence

Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.


ChineseErrorCorrector3-4B: State-of-the-Art Chinese Spelling and Grammar Corrector

Tian, Wei, YuhaoZhou, null

arXiv.org Artificial Intelligence

This paper introduces ChineseErrorCorrector3-4B, a unified model for Chinese spelling and grammatical error correction based on Qwen3-4B. The model demonstrates outstanding performance in general text correction tasks and achieves state-of-the-art results in both spelling correction (CSC) and grammatical correction (CGC). On several authoritative benchmark datasets -- including SIGHAN-2015, EC-LAW, MCSC, and NaCGEC -- the model's F1 and F0.5 scores significantly surpass existing publicly available models, ranking first in both spelling and grammatical error correction tasks.



IndicGEC: Powerful Models, or a Measurement Mirage?

Vajjala, Sowmya

arXiv.org Artificial Intelligence

In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.


Can QE-informed (Re)Translation lead to Error Correction?

Padmanabhan, Govardhan

arXiv.org Artificial Intelligence

The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.